1
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
MASTER THESIS
Recognizing the activities of the person
wearing a camera using advanced deep
neuronal networks
CUONG NGUYEN HUNG
nhcuongit@gmail.com
Ambient Computing, Multimedia and Interaction
Submitted by:
Cuong Nguyen Hung
Advisor:
Assoc. Prof. Dr. Hai Vu
Faculty:
School of Electrical and Electronics Engineering
HANOI, 10/2023
2
MC LC
LIST OF ABBREVIATIONS ............................................................................. 4
LIST OF FIGURES ............................................................................................. 5
LIST OF TABLES ............................................................................................... 7
CHAPTER 1. INTRODUCTION ....................................................................... 8
1.1 General background ................................................................................... 8
1.2 Objectives ................................................................................................... 9
1.3 Thesis outline ............................................................................................. 9
CHAPTER 2. RELATED WORKS ................................................................. 11
2.1 Research on Attention architectures ........................................................ 11
2.2 Egocentric vision for activity recognition ................................................ 13
2.3 Discussions ............................................................................................... 15
CHAPTER 3. FUNDAMENTAL OF NEURONAL NETWORKS .............. 17
3.1 Residual Network (ResNet) architecture ................................................. 17
3.2 Attention-based network architecture ...................................................... 18
3.2.1 Convolutional Block Attention Module (CBAM) .................... 19
3.2.2 Polarized Self-Attention (PSA) ................................................. 21
3.3 YOLO family object detection models .................................................... 24
CHAPTER 4. PROPOSED METHOD ............................................................ 26
4.1 General framework................................................................................... 26
4.2 Collecting data ......................................................................................... 26
4.3 Detecting and tracking the patient's hand ................................................ 27
4.4 Processing the input data for the ResNet network from the Hand tracking
results 28
4.5 The activity recognition model using an Attention-based architecture ... 32
4.6 Deploying the application for detecting the rehabilitation exercises ....... 32
4.6.1 Application architecture ............................................................ 33
4.6.2 List of actors.............................................................................. 33
4.6.3 List of use cases ........................................................................ 33
CHAPTER 5. EXPERIMENTS ........................................................................ 37
5.1 Environment and model evaluation metrics............................................. 37
5.2 Quantitative Evaluation ............................................................................ 37
3
5.3 Visual Evaluation ..................................................................................... 41
5.4 Evaluating the results on the collected videos ......................................... 42
5.5 Application ............................................................................................... 45
5.5.1 Application functions ................................................................ 45
5.5.2 Design application UI ............................................................... 46
CHAPTER 6. CONCLUSION .......................................................................... 47
REFERENCES ................................................................................................... 48
4
LIST OF ABBREVIATIONS
Abbreviations
Meaning
FRE
Functional recovery exercises
EAR
Egocentric activity recognition
bbox
bounding box
SENet
Squeeze and Excitation Network
CNN
Convolutional neural network
DNN
Deep neural network
GCNet
Global context network
NLNet
Non-Local Network
DANet
Dual Attention Network
CBAM
Convolutional Block Attention
Module
CAM
Class activation maps
SAM
Spatial attention map
5
LIST OF FIGURES
Figure 1.1 The basic exercises in functional recovery ........................................... 9
Figure 2.1 SENet architecture diagram ................................................................ 11
Figure 2.2 GCNet architecture diagram ............................................................... 12
Figure 2.3 DANet architecture diagram ............................................................... 13
Figure 2.4 CBAM architecture diagram .............................................................. 13
Figure 2.5 The ResNet model combined with the SAM ...................................... 14
Figure 2.6 The integrated model of the user's camera view direction ................. 15
Figure 2.7 Diagram of the Two-stream with spatiotemporal integration ............ 15
Figure 3.1 ResNet network architecture .............................................................. 17
Figure 3.2 A Residual block of ResNet ............................................................... 18
Figure 3.3 CAM architecture ............................................................................... 20
Figure 3.4 CBAM architecture ............................................................................. 21
Figure 3.5 PSAp architecture ............................................................................... 22
Figure 3.6 PSAs architecture ............................................................................... 22
Figure 3.7 Prior bounding box generation diagram ............................................. 24
Figure 4.1 Diagram of the proposed method ....................................................... 26
Figure 4.2 Wearable camera system .................................................................... 27
Figure 4.3 The method for hand detection and tracking ...................................... 27
Figure 4.4 The Hand-Tracking Result File from the Image Sequence ................ 29
Figure 4.5 The procedure for expanding the left/right hand from the bottom corner
of the frame .......................................................................................................... 30
Figure 4.6 The procedure for expanding the left/right hand from any position .. 30
Figure 4.7 Some post-processing results .............................................................. 31
Figure 4.8 ResNet model combined with Attention architecture ........................ 32
Figure 4.9 Application architecture ..................................................................... 33
Figure 4.10 General use case diagram ................................................................. 34
Figure 4.11 Use case diagram Show list of detected exercises ............................ 34
Figure 4.12 Use case diagram Play videos of patients' rehabilitation exercises .. 35
Figure 4.13 Use case diagram Statistics of the performance of exercises ........... 35
Figure 4.14 Use case diagram Analyze the video ................................................ 35
Figure 5.1 Chart representing Loss and Accuracy (ResNet34 + CBAM) ........... 37
Figure 5.2 Confusion Matrix chart of ResNet34 + CBAM ................................. 38
Figure 5.3 Chart representing Loss and Accuracy (ResNet34 + PSAp) .............. 39
Figure 5.4 Confusion Matrix chart of ResNet34 + PSAp .................................... 39
Figure 5.5 Chart representing Loss and Accuracy (ResNet34 + PSAs) .............. 40
Figure 5.6 Confusion Matrix chart of ResNet34 + PSAs .................................... 40
6
Figure 5.7 Heat map from ResNet34 + CBAM ................................................... 41
Figure 5.8 Heat map from ResNet34 + PSAp ...................................................... 42
Figure 5.9 Heat map from ResNet34 + PSAs ...................................................... 42
Figure 5.10 Model predictions (above), post-processing results (below) ............ 43
Figure 5.11 Model predictions (above), post-processing results (below) ............ 44
Figure 5.12 Main screen ....................................................................................... 46
7
LIST OF TABLES
Table 4.1 Statistics of hand tracking data ............................................................ 27
Table 4.2 Table summarizing the number of images per object in the exercises 31
Table 4.3 List of actors......................................................................................... 33
Table 4.4 List of use cases ................................................................................... 33
Table 5.1 Results of the training and testing set (ResNet34 + CBAM) ............... 38
Table 5.2 Results of the training and testing set (ResNet34 + PSAp) ................. 39
Table 5.3 Results of the training and testing set (ResNet34 + PSAs) ................. 40
Table 5.4 Results of IoU Evaluation .................................................................... 44
Table 5.5 Open video function event stream ....................................................... 45
Table 5.6 Video analysis function event stream .................................................. 45
Table 5.7 Visualize analysis results in table function event stream .................... 45
Table 5.8 Video Player function event stream ..................................................... 46
8
CHAPTER 1. INTRODUCTION
1.1 Background
Recognizing hand activities in egocentric vision is a research area that has
attracted a lot of attention from the community. From an egocentric perspective,
the activities of individuals equipped with sensors can be closely associated with
their daily routines. The hand activities play a particularly important role in
people's daily activities. In recent years, wearable image sensor devices have
widely used. Along with the development of machine learning techniques, making
a feasible solution of recognizing activities with sensor-equipped has significant
potential applications assisted in daily operations.
In the healthcare field, there have been many studies on wearable image
sensors to develop applications that support the treatment process, patient
healthcare, or assistance for the elderly and people with disabilities. One of these
research directions is the automatic recognition of activities for patients after
strokes, accidents, or injuries, and evaluating their recovery capabilities. Once
automatically identifying exercises labels which patients have practiced, more
informative information can be withdrawn. For instance, their exercise
capabilities, consequently, recovery progress will be provided with the most
appropriate treatment recommendations. This is particularly beneficial in the
healthcare center in Vietnam where there is a large number of patients and a limited
number of medical capacity. The proposed approach in this thesis is to address this
issue by using wearable cameras on patients to automatically recognize and assess
their rehabilitation exercise. When patients change their positions and scenarios
during various exercises, the wearable camera accurately captures what is in front
of them; the camera's movements are guided by the activities and attention of the
sensor-wearer; hands and interacting objects tend to appear in the center of the
frame, and hand occlusion is minimized.
The first-person vision offers many advantages when compared to third-
person vision (Thirst Person Vision - TPV), where the camera's position is usually
stable and not attached to the user. The inherent advantages mentioned above have
made the development of new approaches to studying hand activities in
rehabilitation exercises highly appealing. However, with a first-person
9
perspective, researchers also have to deal with a significant issue such as the image
sensors are unstable, often moving along with the user's body. For instance, rapid
movements and abrupt lighting changes can significantly degrade the quality of
collected data. Furthermore, the continuous change of background scenes and the
lack of training datasets, particularly, hand pose, these reasons make traditional
techniques to be unsuitable for deploy the application. As seen in the recognition
of rehabilitation activities as depicted in Figure 1.1, in this thesis, the focus is on
four physical exercises, including exercises with a ball, a cube, a water bottle, and
a cylinder, designed to manage the hand function of clinically treated patients.
Figure 1.1 The basic exercises in functional recovery
1.2 Objectives
In this thesis, we deal with attention-based scheme in the field of activity
recognition, both in general human activity recognition and specifically in hand
activity recognition. It is a vast and complex topic due to the diversity and intricacy
of the human human activities. The overarching goal is to develop automated
techniques for recognizing hand activities applied in the recognition and
assessment of patients' rehabilitation exercises.
Specifically, the objectives of the thesis include:
- Investigating and constructing the database of rehabilitation exercises at
the Hanoi Medical University Hospital.
- Exploring and setup attention-based schema in hand activity recognition.
- Evaluating the attention-based architectures on the different testing
dataset and sequences of video activities for each specific exercise.
1.3 Thesis outline
The thesis is structured into 5 chapters:
10
Chapter 1. Introduction: Presenting the context and rationale for selecting
the topic. This section provides an overview of the research area in the
thesis, specifically focusing on attention-based architectures in hand
activity recognition for identifying and assessing patients rehabilitation
exercises.
Chapter 2. Related works: Presents the current state of research in this field.
This chapter will provide a comprehensive overview of current research and
the approaches that researchers are using.
Chapter 3. Theory of the attention-based architectures: Presenting the
theoretical foundation of the attention algorithms and models used in the
thesis.
Chapter 4. Proposed method: Present in detail the approach and problem
solving. Concepts, analysis methods, data processing, neural network
models is presented and described that need to be built for the application.
Chapter 5. Experimental results. The goal of this chapter is to experiment
and implement based on the method presented in chapter 3, then provide
results of segmenting patients' physical therapy exercises and build an
application interface. At the same time, evaluate the effectiveness of the
problem.
Chapter 6. Conclusion
11
CHAPTER 2. RELATED WORKS
In the problem of recognizing interactions between a person wearing a
wearable sensor and objects, the person's hand is typically the main object
appearing at the center of the frame in the camera wearer's field of view. Therefore,
the approach of this thesis is to apply attention-based architectures to detect
interaction regions between the hand and objects, thereby recognizing the activities
of the camera-wearing sensor. In this chapter, the tasks related to this approach are
presented as follows:
2.1 Research on Attention-based architectures
The emergence of Attention-based architectures in deep learning has
significantly enhanced the effectiveness of many models, and it continues to be an
indispensable component in state-of-the-art models. In the paper "Squeeze-and-
Excitation Networks" [4] by authors Jie Hu, Li Shen, Samuel Albanie, Gang Sun,
and Enhua Wu, they proposed an attention architecture named SENet, which
represents a typical example of channel attention mechanisms. SENet is a
relatively simple network consisting of a few layers aimed at enhancing
information exchange between channels, thus improving the representation quality
of CNN models. SENet accomplishes this by utilizing all the information and then
selectively emphasizing the important features in each channel while paying less
attention to less important ones. SENet consists of two parts (Figure 2.1): the
Squeeze part and the Excitation part. The Squeeze part is responsible for gathering
global information using Global Average Pooling. The Excitation part is
responsible for creating channel-wise attention using two Fully Connected layers.
Using this technique, the authors achieved a top-5 error rate of 2.251% on the
ILSVRC 2017 dataset.
Figure 2.1 SENet architecture diagram
12
In the paper "GCNet: Non-local Networks Meet Squeeze-Excitation
Networks and Beyond" [5] by authors Yue Cao, Jiarui Xu, Stephen Lin, Fangyun
Wei, and Han Hu, they proposed an attention architecture named GCNet, which is
a representative example of channel attention mechanisms. The methods for
computing and aggregating contextual features used in GCNet are inherited from
NLNet [5]. GCNet consists of 3 modules: the Context modeling module aggregates
features from all positions of the channels in the input feature map to create a
shared contextual feature, the Feature transform module performs convolution to
capture inter-channel dependencies, and the Fusion module combines the shared
contextual feature into features at all positions in the input feature map, as shown
in Figure 2.2. Using this technique, the authors achieved a top-1 accuracy of
76.00% on the ImageNet dataset.
Figure 2.2 GCNet architecture diagram
In the paper "Dual Attention Network for Scene Segmentation" [6] by authors
Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing
Lu, the DANet architecture, a dual-attention mechanism, is proposed. It consists
of two main modules: the Position Attention Module and the Channel Attention
Module. To address the problem [6], the authors proposed a deep learning network
13
model as shown in Figure 2.3 with two independent attention modules, one for
Position attention and the other for Channel attention, and integrated the results
into the ResNet network. Using this approach, the authors achieved a Mean IoU
score of 81.5% on the Cityscapes test set without using raw data.
Figure 2.3 DANet architecture diagram
In the paper CBAM: Convolutional Block Attention Module [2], the
authors Sanghyun Woo and colleagues proposed the CBAM attention architecture.
CBAM consists of two parts: Channel Attention and Spatial Attention. The input
feature maps undergo Channel Attention first and then proceed to Spatial
Attention, as depicted in Figure 2.4.
Figure 2.4 CBAM architecture diagram
2.2 Egocentric vision for activity recognition
Currently, several individual tasks in egocentric vision offer practical
applications in various aspects of life, including education, sports, healthcare, and
more. Alongside this, the diversity of approaches in analyzing image sequences is
demonstrated through the following related studies.
14
In the paper titled "Attention is All We Need: Nailing Down Object-centric
Attention for Egocentric Activity Recognition" [7], Swathikiran Sudhakaran and
colleagues built a DNN model for activity recognition by observing the attention
to object centers and their respective locations. Based on this, the authors
developed a spatial attention mechanism that allows the network to focus on
regions containing objects relevant to the observed activity. In this research, the
authors utilized a pre-trained ResNet-34 network [1] on ImageNet as the backbone
architecture. For each RGB input frame, the authors first computed the CAM [8]
by utilizing the class with the highest probability. Subsequently, the obtained CAM
was transformed into a probability map by applying softmax along the spatial
dimensions. The SAM was then multiplied with the output of the final
convolutional layer of ResNet-34, as depicted in Figure 2.5. Using this approach,
the authors achieved an accuracy of up to 63.79% on the GTEA 61 dataset.
Figure 2.5 The ResNet model combined with the SAM
In the paper "Integrating Human Gaze into Attention for Egocentric Activity
Recognition" [9] by authors Kyle Min and Jason J. Corso, the goal is to integrate
the user's gaze direction into the task of activity recognition in egocentric vision.
The authors employed a two-stream I3D [10] network as the backbone
architecture, as depicted in Figure 2.6. The input consists of two sequential streams
over time, including RGB frames and Optical flow frames. To model the gaze
distribution, they used the same convolutional blocks of I3D (Mixed 5b-c) and
added three additional convolutional layers on top of it. Through this technique,
the authors achieved a mean class accuracy of 62.84 on the EGTEA dataset and
64.81 on the GTEA Gaze+ dataset.
15
Figure 2.6 The integrated model of the user's camera view direction
In the research Learning Spatiotemporal Attention for Egocentric Action
Recognition [11], the authors made advancements in training spatiotemporal
features by applying 3D convolutions. Therefore, the authors proposed a simple
yet effective module to learn spatiotemporal attention in egocentric videos. The
model in the study uses a two-stream architecture, as depicted in Figure 2.7. Each
stream takes RGB/flow videos as inputs and has a spatiotemporal attention module
to generate attention maps. Through this technique, the authors achieved a Micro
accuracy of 68.6% on the EGTEA Gaze+ dataset.
Figure 2.7 Diagram of the Two-stream with spatiotemporal integration
2.3 Discussions
The approach of using attention architectures in the problem of egocentric
activity recognition has yielded promising results on various egocentric datasets.
However, egocentric activity recognition research faces both advantages and
challenges. In terms of advantages, egocentric videos capture critical aspects of
activities, provide diverse data, and can adapt to changes in lighting conditions and
overall scenes. This enables recognizing the scene in which the user is engaged,
16
inferring the user's state based on what they are paying attention to, and
determining object locations.
As for the challenges, several factors come into play. The fact that the camera
is not fixed makes it challenging to distinguish between the background and
foreground. Variations in lighting conditions require focusing on shape-related
features more than color. Real-time requirements and processing videos on
embedded devices add complexity.
To leverage the advantages and address these challenges, feature extraction
from images and videos is crucial. This involves selecting features related to shape
and motion, such as optical flow or aggregating frame-level features within
temporal windows. Moreover, finding suitable model architectures and parameter
tuning is a time-consuming process that requires extensive experimentation and
evaluation.
17
CHAPTER 3. FUNDAMENTAL OF NEURONAL NETWORKS
3.1 Residual Network (ResNet) architecture
ResNet (Residual Network) is a breakthrough deep neural network
architecture introduced by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun in 2015 [1]. ResNet achieved the top position in the ILSVRC 2015
competition with a top-5 error rate of only 3.57%. Moreover, it also ranked first in
the ILSVRC and COCO 2015 competitions in categories including ImageNet
Detection, ImageNet Localization, Coco Detection, and Coco Segmentation.
From the above, it can be seen that ResNet is one of the pivotal architectures
in the field of deep learning and has achieved significant success in tasks such as
image recognition and classification. Currently, there are various variants of the
ResNet architecture with different numbers of layers, such as ResNet-18, ResNet-
34, ResNet-50, ResNet-101, ResNet-152, etc. The name 'ResNet' is followed by a
number indicating the specific architecture with a certain number of layers. ResNet
addresses the problem of traditional deep networks and can easily learn with
hundreds of layers.
Below is an overview of the structures of various versions of ResNet,
including ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152.
Figure 3.1 ResNet network architecture
ResNet (R) is a CNN designed to work with hundreds of layers. One issue
that arises when building a CNN with many convolutional layers is the Vanishing
Gradient problem, which leads to poor learning. Therefore, the solution proposed
18
by ResNet is to use "skip" connections to pass through one or more layers. Such a
block is called a Residual Block, as shown in the following figure:
Figure 3.2 A Residual block of ResNet
ResNet is quite similar to networks consisting of convolution, pooling,
activation, and fully-connected layers. The image above displays a Residual block
used in the network. An arched arrow starts from the beginning and ends at the end
of the Residual block. In other words, it adds the Input X to the output of the layer,
which is the addition we see in the illustration. This operation prevents the gradient
from becoming zero because X is still being added. Thus, the ResNet structure
with residual blocks helps reduce overfitting during training. The short
connections (skip connections) between layers allow information to pass more
easily, avoiding information loss and reducing the vanishing gradient problem.
This enables the network to perform better on small training datasets. Therefore,
this thesis chooses to use a CNN with the ResNet model for the activity recognition
task.
3.2 Attention-based network architecture
The method of focusing attention on important regions and areas in an image
while disregarding unimportant regions is called an attention mechanism. In
Computer Vision, an attention mechanism is the process of selectively choosing,
through weighted allocation, different features based on the importance of the
input data.
Mathematical representation:

󰇛
󰇛
󰇜
󰇜
19
In which:
- g(x) is the process of creating attention to determine which part/region to
focus on.
- f(g(x),x) is the process of processing the input x based on the information
of which part/region is important through g(x).
3.2.1 Convolutional Block Attention Module (CBAM)
CBAM (Convolutional Block Attention Module) [2] is a specialized attention
mechanism designed to enhance the performance of CNNs in image processing.
CBAM Attention combines both Spatial Attention and Channel Attention to
enhance the ability to focus on important features in both the spatial and channel
dimensions of an image, as shown in Figure 2.4.
CBAM takes the Intermediate feature map

as input data for the
Channel Attention Module, resulting in a 1D Channel attention map

.
The output of the Channel attention map is then used as input data for the Spatial
Attention Module, resulting in a 2D Spatial attention map


.
The overall attention process can be summarized as follows:
󰆒
󰇛
󰇜

󰇛

󰇜
󰆒󰆒

󰇛
󰆒
󰇜

󰆒
󰇛

󰇜
In which:
- represents element-wise multiplication.
-
󰆒󰆒
is the final refined output from CBAM.
The Channel Attention Module focuses on selecting important features
within the channel information of the image, creating a channel attention map by
leveraging relationships between feature channels. Each channel of a feature map
is considered as a feature detector. According to the research [2], first, to
synthesize the spatial information of a feature map using both average-pooling and
max-pooling, two feature descriptions

and

are obtained, respectively.
These two feature descriptions are then forwarded to a shared network to generate
a channel attention map denoted as

. The shared network is
composed of a multi-layer perceptron (MLP) with one hidden layer to minimize
the number of parameters. The size of the hidden activation is set to

,
20
where r is the reduction ratio. After applying the shared network to each feature
description, we combine the output feature vectors through element-wise
summation.
The formula for calculating channel attention is as follows:
(1.3)
In which:
- sigmoid function.
-

-

Figure 3.3 CAM architecture
The Spatial Attention Module is created by exploiting the spatial
relationships between features in the feature map. The process of generating a
Spatial Attention Map leverages the spatial relationships between spatial regions
in the Feature Map. Unlike channel attention, which focuses on "what" is an
important part of the information, spatial attention focuses on "where" is an
important part of the information, complementing channel attention. To calculate
spatial attention, we first apply average-pooling and max-pooling operations along
the channel axis and then combine them to create an effective feature descriptor.
In other words, the process involves performing average-pooling and max-
pooling operations separately on each channel of the feature map and then
combining them into a single feature descriptor. This feature descriptor contains
information about the importance of spatial regions in the feature map, helping to
21
determine "where" the important and useful locations are for the model during data
processing. Spatial attention supports focusing on specific spatial regions in the
data, complementing the capabilities of channel attention in identifying important
features by channel.
In the research [2], the authors aggregated channel-wise information from a
feature map using two different pooling operations, creating two distinct 2D
feature maps:



and



. These two feature maps are
then concatenated together. After the concatenation, a standard convolutional layer
is applied to the concatenated result. This process generates a 2D spatial attention
map.
Which is calculated using the following formula:
󰇛

󰇜
In which:
- represents the sigmoid function.
-

denotes a convolution operation performed on the feature map using
a filter of size 7×7.
Figure 3.4 CBAM architecture
3.2.2 Polarized Self-Attention (PSA)
In the research [3], the group of authors proposed the Polarized Self-
Attention (PSA) architecture. The PSA architecture consists of two sub-modules:
the Channel-only Self-Attention module and the Spatial-only Self-Attention
module. PSA maximizes its representational capacity in both its Channel-only and
22
Spatial-only branches to the extent that there is only a slight numerical difference
between sequential (Figure 3.6) and parallel (Figure 3.5) structures.
Figure 3.5 PSAp architecture
Figure 3.6 PSAs architecture
Assuming that the input features for both sequential and parallel architectures
are

Channel-only Self-Attention Module: With the assumed input features as
above, it can generate a channel-wise attention map [3]

󰇛
󰇜

23

󰇛
󰇜

󰇩

󰇧
󰇡
󰇛
󰇜
󰇢


󰇡
󰇛
󰇜
󰇢
󰇨
󰇪
(2.1)

󰇛
󰇜

(2.2)
In which:
-
,
and
are 1x1 convolution layers.
-
,
are two tensor resizing operations.
-

󰇛
󰇜
is the SoftMax operator (2.2).
- is the dot-product operator.
-

󰇛
󰇜
is the Sigmoid operator.
The number of internal channels between the blocks

and
is .
The output of the channel self-attention branch with 

denotes the
channel-wise multiplication operator.


󰇛
󰇜


(2.3)
Spatial-only Self-Attention Module: With the assumed input as above, it can
generate a spatial attention map [3]

󰇛
󰇜

.

󰇛
󰇜


󰇧


󰇡
󰇛
󰇜
󰇢
󰇨
󰇡
󰇛
󰇜
󰇢
(2.4)

󰇛
󰇜
󰇛
󰇜


(2.5)
In which:
-
and
are 1×1 convolution layers.
-
,
and

are three tensor resizing operators.
-

󰇛
󰇜
is the SoftMax operator (2.2).
- is the dot-product operator.
-

󰇛
󰇜
is the Sigmoid operator.
-

󰇛
󰇜
is the Global Pooling operator (2.5).
The output of the spatial self-attention branch with the ^sp spatial
multiplication operation:


󰇛
󰇜


(2.6)
24
Concatenation: The outputs of the two branches are concatenated under a
parallel layout structure.

󰇛
󰇜




󰇛
󰇜


󰇛
󰇜

(2.7)
The outputs of the two branches are concatenated under a sequential layout
structure.

󰇛
󰇜

󰇛

󰇜


󰇛

󰇛
󰇜

󰇜


󰇛
󰇜

(2.8)
3.3 YOLO family object detection models
YOLO (You Only Look Once) is a computer vision object detection model
developed by Joseph Redmon and Ali Farhadi in 2016. It is an improved version
of the original YOLO model and achieves higher performance and faster
processing speed compared to the previous version. The first version, YOLOv2
model uses anchor boxes, pre-defined bounding boxes with appropriate shapes and
sizes that are customized during training. The selection of bounding boxes for
processed images is done by using the k-means clustering algorithm on the training
dataset.
Importantly, the predicted bounding boxes are fine-tuned to allow minor
adjustments that have less impact on predictions, resulting in a more stable model.
Instead of directly predicting the position and size, offsets (i.e., center coordinates,
width, and height) are predicted to move and reshape the pre-defined anchor boxes
at each grid cell through logistic functions.
Figure 3.7 Prior bounding box generation diagram
25
YOLO has become one of the popular and effective object detection models
in the field of computer vision. It has been widely used in various real-world
applications, including object recognition, object tracking, and support in
autonomous driving systems.
26
CHAPTER 4. PROPOSED METHOD
4.1 General framework
Based on the relevant research in the previous chapter, in this chapter, the
proposed method for recognizing hand activities is described. The proposed frame-
work is given in Fig. 4.1 comprising main parts as following:
Hand detection and tracking.
Processing the b-box (Boundary box) from the Hand-tracking step. It then
is input into the ResNet with integrated-attention modules. Utilizing ResNet
in combination with an attention module (e.g., CBAM, PSA) to detect
regions of interest where the hand interacts with objects, thereby
recognizing hand activities.
Finally, the application based on the proposed attention schema is
deployed..
Figure 4.1 Diagram of the proposed method
4.2 Collecting data
The proposed RehabHand dataset was collected at the Hanoi Medical
University. It was gathered from wearable sensors, including a first-person view
camera, accelerometers, and gyroscopes, as setup in Figure 4.2. This data
collection involved 10 patients undergoing functional recovery through four basic
hand exercises (Figure 1.1) which include exercises with wooden blocks, exercises
with a ball, exercises with a water bottle, and exercises with a cylindrical block.
These exercises were performed under the guidance of specialized physicians. The
27
dataset was collected over a period of more than 4 hours. Data modalities include
video, accelerometer, and gyroscope data in a real hospital environment.
Figure 4.2 Wearable camera system
4.3 Detecting and tracking the patient's hand
In this thesis, an object tracking technique has been used to track hand
activities. The goal is to establish a patient’s input and recognizing activities using
attention-based architectures. The workflow for the hand tracking task (based on
hand detection results) is illustrated in Figure 4.3.
Figure 4.3 The method for hand detection and tracking
The model tracks hand activities in egocentric videos by combining object
detection using YOLOv8 with the ByteTrack tracking technique. Table 4.1
provides statistics on the data used. A total of 6 video segments were labeled,
outlining all regions where hands appear in the frames.
Table 4.1 Statistics of hand tracking data
No
Segment video name
Exercises
Number
of frames
Number of
hands / frame
1
GH010354_5_17718_19366
1
1684
1
28
2
GH010373_5_1284_2724
1
1440
5
3
GH010358_6_10208_11900
2
1594
4
4
GH010373_6_3150_4744
2
1692
6
5
GH010358_7_2490_3390
3
900
4
6
GH010358_8_8000_8547
4
547
3
Out of the 6 videos, approximately 7821 images were annotated manually,
with a training-to-testing ratio of 8:2 on the total number of images.
4.4 Processing the input data for the ResNet network from the Hand
tracking results
After tracking the hand activity in the consecutive image sequence using the
tracking technique, we obtain a file named <file.txt>, which contains the results
with values listed from left to right (Figure 4.4) corresponding to columns as
follows: <frame_id>, <id>, <bb_left>, <bb_top>, <bb_width>, <bb_height>,
<conf>, <x>, <y>, <z>.
29
Figure 4.4 The Hand-Tracking Result File from the Image Sequence
In which: <frame_id> represents the frame's order in the image sequence;
<id> indicates the tracked hand objects, including the left hand, right hand of the
patient, and the hand of the instructor; <bb_left>, <bb_top>, <bb_width>,
<bb_height> are the bounding box information.
Since the data was collected using a wearable camera system with hands as
the primary objects, the direction from the arm to the hand tends to move from the
left/right side of the frame towards the center of the frame. Based on the received
information, I extended and cropped the frames using the Bbox coordinates in both
the width and height dimensions, adding 200 pixels to each dimension to capture
30
information about the interacting hand object. The process is illustrated in Figure
4.5 and Figure 4.6.
Figure 4.5 The procedure for expanding the left/right hand from the bottom corner of
the frame
Figure 4.6 The procedure for expanding the left/right hand from any position
31
Figure 4.7 Some post-processing results
From the data processing steps described above, I have constructed a
database used during the training of the ResNet34 model combined with attention
architectures (CBAM, PSAs, PSAp). The database consists of 3973 processed
images with labels for 6 objects: Ball, Water bottle, Hand, Cube and Cylinder as
shown in Table 4.2.
Table 4.2 Table summarizing the number of images per object in the exercises
Class_id
Label
Number of
train images
Number of
validation images
Khích
thước nh
0
Ball
207
89
224x224
1
Water bottle
715
307
2
Hand
715
307
3
Cube
446
192
4
Cylinder
404
174
Sum
2778
1195
32
4.5 The activity recognition model using an Attention-based architecture
The proposed method utilizing a deep learning model for recognizing
activities of individuals wearing sensors through the identification of regions of
interest between the hand and objects. ResNet34 is the backbone for the entire
architecture and is integrated with attention architectures such as CBAM, PSAs,
PSAp. These attention architectures are added to the Residual Block as shown in
Figure 4.8.
Figure 4.8 ResNet model combined with Attention architecture
From the training and testing on the processed dataset as described in Table
4.1 and on the video sequences identifying exercises, I will evaluate the CBAM,
PSAs, PSAp architectures and choose the appropriate Attention architecture within
the scope of the thesis for the egocentric activity recognition task.
4.6 Deploying the application for detecting the rehabilitation exercises
33
4.6.1 Application architecture
In the following architectural model will clearly describe the components that
make up the application.
Figure 4.9 Application architecture
4.6.2 List of actors
Table 4.3 List of actors
No
Actor
Description
1
Doctor
Doctor in charge of treating
patients with rehabilitation
4.6.3 List of use cases
Table 4.4 List of use cases
No
Use Case
Description
1
Show list of detected exercises
Displays the list of detected
exercises as a tree table.
Exercises segmented by “Start
time” and “Stop time” columns
2
Play videos of patients'
rehabilitation exercises
Play a video recording the
patient's performance of
rehabilitation exercises
3
Statistics of the performance of
exercises
Statistics of duration and
frequency of exercises
4
Analyze the video
Analyze the video and save the
result to CSV file
34
Figure 4.10 General use case diagram
4.6.3.1 Show list of detected exercises
Figure 4.11 Use case diagram Show list of detected exercises
a. Summary
The doctor use this use case to view a list of detected exercises in the video
and jump to an exercise on time bar.
b. Line of events
Actor's actions
Application ressponse
1. Click on the Expand arrow
Displays detected exercises for the
corresponding time period
2. Click on the Collapse arrow
Collapse the detected exercises
3. Click on a row
Play to the corresponding time
35
4.6.3.2 Play videos of patients' rehabilitation exercises
Figure 4.12 Use case diagram Play videos of patients' rehabilitation exercises
a. Summary
The doctor use this use case to view the progress of the patient's exercise in
the video.
b. Line of events
Actor's actions
Application ressponse
1. Select a video file on local
Plays the selected video
2. Click on the “Play” button
Play the video
3. Click on the “Pause” button
Pause the video
4. Click on the “Forward” button
Fast forward to ahead
5. Click on the “Backward” button
Fast backward to behind
6. Click on the time bar
Play to the corresponding time
4.6.3.3 Statistics of the performance of exercises
Figure 4.13 Use case diagram Statistics of the performance of exercises
a. Summary
The doctor use this use case to see stats like: duration and frequency of
exercises
b. Line of events
N/A
4.6.3.4 Analyze the video
Figure 4.14 Use case diagram Analyze the video
a. Summary
36
The doctor use this use case to record detected results to CSV file.
b. Line of events
Actor's actions
Application ressponse
1. Click on the “Analyzebutton
Analyze the video. Display the exercise
detection results and export the result to
CSV file and notifies successful
37
CHAPTER 5. EXPERIMENTS
5.1 Environment and model evaluation metrics
The models and algorithms of the thesis are programmed and trained using
Python programming language and Pytorch Tensorflow backend library on
computers with GeForce GTX 1080 Ti GPU cards. Classification accuracy and
confusion matrix is utilized to evaluate the proposed method. The application is
developed using Python language and PyQt6 library.
5.2 Quantitative Evaluation
The recognition results of the ResNet34 + CBAM model:
Figure 5.1 Chart representing Loss and Accuracy (ResNet34 + CBAM)
38
Figure 5.2 Confusion Matrix chart of ResNet34 + CBAM
Table 5.1 Results of the training and testing set (ResNet34 + CBAM)
Class_name
Precision
Recall
F1-Score
Accuracy(%)
Ball
0.98
0.96
0.97
95
Water bottle
0.97
0.99
0.98
99
Hand
0.95
0.87
0.91
87
Cube
0.98
0.99
0.98
99
Cylinder
0.95
0.98
0.96
99
Based on the results and the table above, we can see that the Precision, Recall,
and F1-Score results are high for all 5 classes name. The recognition rate of hand
activities with objects is greater than 85%. The model performs relatively well.
The ResNet34 + PSAp model
39
Figure 5.3 Chart representing Loss and Accuracy (ResNet34 + PSAp)
Figure 5.4 Confusion Matrix chart of ResNet34 + PSAp
Table 5.2 Results of the training and testing set (ResNet34 + PSAp)
Class_name
Precision
Recall
F1-Score
Accuracy(%)
Ball
0
0
0
0
Water bottle
0.99
0.99
0.99
99
Hand
0.78
0.90
0.84
90
Cube
0.97
0.96
0.96
96
Cylinder
0.94
0.99
0.96
99
40
Based on the results and the table above, we can see that the model does not
recognize hand activities interacting with the Ball object. The model's performance
is not good.
The ResNet34 + PSAs model
Figure 5.5 Chart representing Loss and Accuracy (ResNet34 + PSAs)
Figure 5.6 Confusion Matrix chart of ResNet34 + PSAs
Table 5.3 Results of the training and testing set (ResNet34 + PSAs)
Class_name
Precision
Recall
F1-Score
Accuracy(%)
Ball
0.97
0.97
0.97
97
41
Water bottle
0.99
0.99
0.99
99
Hand
0.97
0.89
0.93
87
Cube
0.97
0.99
0.98
99
Cylinder
0.96
0.99
0.97
99
Based on the results and the table above, we can see that the Precision, Recall,
and F1-Score are high for all 5 objects. The recognition rate of hand activities with
objects is over 85%. The proposed model performs well on the practical testing
data.
Discussions:
Based on the evaluation results using Precision, Recall, F1-Score, and
Accuracy, we can observe that the ResNet34 model combined with CBAM and
PSAs performs well in recognizing hand activities with all 4 objects achieving very
good results. However, the ResNet34 model combined with PSAp fails to
recognize hand activities involving the Ball object, indicating its lower
effectiveness.
5.3 Visual the evaluation results
Figure 5.7 Heat map from ResNet34 + CBAM
42
Figure 5.8 Heat map from ResNet34 + PSAp
Figure 5.9 Heat map from ResNet34 + PSAs
Discussion: From the heat map diagrams in Figures 5.7, 5.8, and 5.9, it can
be observed that the ResNet34 model combined with PSAs performs the best. The
attention regions generated by this network model effectively encompass the
interaction area between the hand and objects in the rehabilitation.
5.4 Evaluating the results on the collected videos
From the quantitative and visual evaluation results, it is evident that the
ResNet34 model combined with PSAs achieves the best performance. Therefore,
I will employ ResNet34 combined with PSAs to assess exercises in videos using
43
post-processing methods with sequence prediction results and the Intersection over
Union (IoU) metric as follows:
Post-processing method for sequence prediction results: From the
prediction results with Class_id: 0, 1, 2, 3, 4 corresponding to exercises such
as ball exercise, water bottle exercise, no interaction, box exercise, and
cylinder exercise as shown in Table 3.2. In a sequence of images from a
video exercise, the prediction results of the ResNet34 model combined with
PSAs for N consecutive frames for a specific class_id, which has the highest
occurrence, are used to represent that the activity of that class_id is being
performed in those N frames, as depicted in Figure 5.10.
Figure 5.10 Model predictions (above), post-processing results (below)
Calculate the Intersection over Union (IoU) index for hand activities:
The thesis calculates the Intersection over Union (IoU) index for each frame
sequence in the exercise videos to measure the similarity between the post-
processed results and the ground truth labels, as depicted in Figure 5.11.
44
Figure 5.11 Model predictions (above), post-processing results (below)
The IoU Calculation Formula:

󰇛󰇜

The formula to calculate the average IoU for an exercise in a video:



Table 5.4 Results of IoU Evaluation
No
Segment video name
Number
of frame
Exercise
Average
IoU
1
GH010354_5_17718_19366_Cau2
760
Ball
0.45
Hand
0.43
2
GH010354_5_17718_19366_Cau
152
Ball
0.66
Hand
0.11
3
GH010373_6_3150_4744
139
Water
bottle
0.60
4
GH010382_6_18190_20215_2
150
Water
bottle
0.68
Dicussions: Through the evaluation process on consecutive image sequences
with exercises involving water bottles and spheres, the model's selection based on
activities is still not performing well.
45
5.5 Application
5.5.1 Application functions
5.5.1.1 Open video
This function allows the doctor to select video and analysis results from the
computer to view on the screen.
Table 5.5 Open video function event stream
Actor's actions
Application response
1. Click on Open Video button
Show file dialog to choose directory
contains video
2. Choose a directory
Load video and analysis results (if any)
on the screen
5.5.1.2 Video Analysis
This function allows the doctor to perform analysis to segment the patient's
exercises from the opened video.
Table 5.6 Video analysis function event stream
Actor's actions
Application response
1. Click on Analysis Video button
Perform segmentation of the patient's
exercises in the video then display the
analysis result on the screen
5.5.1.3 Visualize analysis results in table
This function displays analysis results in table form, each record
corresponding to a time point in the video.
Table 5.7 Visualize analysis results in table function event stream
Actor's actions
Application response
1. Click on a row
Jump to corresponding frame on Video
player
5.5.1.4 Statistics of analysis results in chart
This function statistics video analysis results in chart form, statistic metrics
include:
- Number of repetitions of each exercise;
- Exercise duration of each exercise;
- Ratio of exercise performance between left and right hands.
46
5.5.1.5 Video Player
This function allows doctors to view patient exercise videos. Exercise
segments are displayed on the bar with different colors corresponding to different
exercises. The doctor can forward, backward, or jump to the desired frame.
Table 5.8 Video Player function event stream
Actor's actions
Application response
1. Click Start button
Play or resume video
2. Click Stop button
Pause video
3. Click Forward button
Fast forward video
4. Click Backward button
Fast backward video
5. Click on status bar
Jump to corresponding frame
5.5.2 Design application UI
Figure 5.12 Main screen
1. Toolbar with functions such as:
a. Open the video
b. Video analysis
c. Close the application
2. Table of patient exercise segmentation results
3. Statistical chart of results
4. Video player: Demonstration video is available to download:
https://youtu.be/d1WIKbcsj-g
47
CHAPTER 6. CONCLUSION
In this thesis, I had deployed, evaluated, and tested models for recognizing
activities of person who wear wearable sensors. Particularly, the proposed method
involved interactions between hands and objects on the RehabHand dataset. The
thesis has successfully accomplished the tasks following:
- Investigate the PHCN exercise database at the hospital.
- Deploying the technique for the patient hand detection and tracking using
the YOLOv8 and ByteTrack models.
- Through quantitative and visual evaluation, ResNet34 network combined
with the PSAs module give the best results of recognizing activities for
individuals with wearable sensors.
However, the thesis still has some limitations, such as:
- The feature extraction is not yet optimized.
- The data processing after tracking is not ideal as it depends on the image
cropping process.
- The program execution time is still slow and cannot achieve real-time
processing.
- Recognizing activity rate is degraded in case that are blurry due to fast
movements of the patients’ hand.
Since the achieved results and existing limitations, the thesis needs further
improvement and development in the following directions:
- Improve the attention region localization, especially in cases where
frames contain multiple objects that need to be recognized.
- Expand the dataset for evaluating patient exercises.
- Enhance the model to able to run in real-time.
Finally, during the process of writing this thesis, I have learned many
experiences such as analyzing and processing image data. At the same time, I had
learnt and used powerful tools and libraries for machine learning such as
Matplotlib, OpenCV as well as Python. These experiences will help me have the
necessary skills for my jobs after graduation.
48
REFERENCES
[1] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image
Recognition," in 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2016, pp. 770-778.
[2] Kweon, Sanghyun Woo and Jongchan Park and Joon-Young Lee and In-So,
"CBAM: Convolutional Block Attention Module," in European Conference
on Computer Vision, 2018.
[3] H. Liu, F. Liu, X. Fan and D. Huang, "Polarized self-attention: Towards high-
quality pixel-wise mapping," Neurocomputing, vol. 506, no. 0925-2312, pp.
158-167, 2022.
[4] L. S. a. G. S. J. Hu, "Squeeze-and-Excitation Networks," in 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT,
USA, 2018, pp. 7132-7141.
[5] Y. a. X. J. a. L. S. a. W. F. a. H. H. Cao, "GCNet: Non-Local Networks Meet
Squeeze-Excitation Networks and Beyond," in 2019 IEEE/CVF International
Conference on Computer Vision Workshop (ICCVW), 2019, pp. 1971-1980.
[6] Fu, Jun and Liu, Jing and Tian, Haijie and Li, Yong and Bao, Yongjun and
Fang, Zhiwei and Lu, Hanqing, "Dual Attention Network for Scene
Segmentation," in 2019 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2019, pp. 3141-3149.
[7] S. Sudhakaran and O. Lanz, "Attention is All We Need: Nailing Down
Object-centric Attention for Egocentric Activity Recognition," ArXiv, vol.
abs/1807.11794, 2018.
[8] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva and A. Torralba, "Learning Deep
Features for Discriminative Localization," in 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA,
2016, pp. 2921-2929.
[9] K. Min and C. Jason J, "Integrating Human Gaze into Attention for Egocentric
Activity Recognition," in 2021 IEEE Winter Conference on Applications of
Computer Vision (WACV), 2021, pp. 1068-1077.
49
[10] J. Carreira and A. Zisserman, "Quo Vadis, Action Recognition? A New
Model and the Kinetics Dataset," in 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2017, pp. 4724-4733.
[11] M. Lu, D. Liao and Z.-N. Li, "Learning Spatiotemporal Attention for
Egocentric Action Recognition," in 2019 IEEE/CVF International
Conference on Computer Vision Workshop (ICCVW), 2019, pp. 4425-4434.